n 

Oft 



IN THE UNITED STATES PA'TENT'AND TRADEMARK OFFICE 
UTILITY PATENT APPLICATION TRANSMITTAL 
UNDER 37 CFR 1.53(b) 



A 



Box Patent Application 

Assistant Commissioner for Patents 

Washington, DC 20231 


Attorney Docket No, 


e ^= 

STL9-2000-0055 o!^ = 

CO — o— 


Inventor(s) 


John R. Ehrman ^^"^ 


Express Mail Label No. 


EL290559025US S^Si^ 


Filing Date 


July 10, 2000 ^ ^ 



TiYfe of Application: method of, system for, and computer program product for creating and 

CONVERTING TO UNICODE DATA FROM SINGLE BYTE CHARACTER SETS, DOUBLE BYTE CHARACTER SETS, OR 
mixed CHARACTER SETS COMPRISING BOTH SINGLE BYTE AND DOUBLE BYTE CHARACTER SETS 



Transmitted with the patent application are the following: 

47 Page(s) Specification, Claims and Abstract 

J_ Page(s) Formal drawings 

2 Page(s) Declaration and Power of Attorney 

3 Page(s) Assignment of the Invention to International Business Machines Corporation (inc. Rec Cover Sheet in Duplicate) 
Page(s) Information Disclosure Statement (IDS/PTO 1 449) (copies of citations not included in number of pages) 

_ Copies of IDS citations 

jc. Return Receipt Postcard (MPEP 503). 



Fee Calculation: 





Claims 




Extra 


Rate 


Fees 


Basic Fee 










$690.00 


Total Claims 


24 


-20 = 


4 


X $18.00 


$ 72.00 


Independent Claims 


3 


-3 = 


0 


X $78.00 


n/a 


Multiple Depdendent Claims 








+$260.00 
























TOTAL 


$762.00 



X Please charge my Deposit Account No. 09-0460 in the amount of $ 762.00 . A duplicate copy of this sheet is attached. 
X The Commissioner is hereby authorized to charge payment of the following fees associated with this communication or credit any 

overpayment to Deposit Account 09-0460 . A duplicate copy of this sheet is attached. 
X Any filing fees under 37 CFR 1.16 for the presentation of extra claims. 
X Any patent application processing fees under 37 CFR 1.17. 

Respectfully submitted, 
EXPRESS MAIL CERTIFICATE John R. Ehrman 

I hereby certify that the above paper/fee is being deposited 
with the United States Postal Service "Express Mail Post Office to Addressee" 
service under 37 CFR 1.10 on the date indicated below and is addressed 
to the Assistant Commissioner for Patents, Washington, DC 20231 . 

Registration No. #33,123 

Date of Deposit: July 10, 2000 IBM Corporation 

Intellectual Property Law 

Person MaiMng paper/fee: Jeanette Berry Durbin 555 Bailey Avenue (J46/G467) 

I ID /} f) A San Jose, CA 95141-1003 

Signature: dl^MUKSlhJL^ Telephone (408) 463-5673 




SPECIFICATION 

IBM Docket No. STL9-2000-0055 
TO ALL WHOM IT MAY CONCERN: 

BE IT KNOWN that I, John R. Ehrman of Sunnyvale, Caiifomia and citizen of the United 
States have invented new and useful improvements in 



METHOD OF, SYSTEM FOR, AND COMPUTER PROGRAM PRODUCT FOR 
CREATING AND CONVERTING TO UNICODE DATA FROM SINGLE BYTE 
CHARACTER SETS, DOUBLE BYTE CHARACTER SETS, OR MIXED 
CHARACTER SETS COMPRISING BOTH SINGLE BYTE AND DOUBLE BYTE 

CHARACTER SETS 

of which the following is a specification: 
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1 METHOD OF, SYSTEM FOR, AND COMPUTER PROGRAM PRODUCT FOR 

2 CREATING AND CONVERTING TO UNICODE DATA FROM SINGLE BYTE 

3 CHARACTER SETS, DOUBLE BYTE CHARACTER SETS, OR MIXED 

4 CHARACTER SETS COMPRISING BOTH SINGLE BYTE AND DOUBLE BYTE 

5 CHARACTER SETS 

6 
7 

8 CROSS-REFERENCE TO RELATED APPLICATIONS 

9 

1 0 Application Serial Number , filed concurrently herewith on July 1 0, 2000 

1 i, for DATA STRUCTURE FOR CREATING, SCOPING, AND CONVERTING TO 

123 UNICODE DATA FROM SINGLE BYTE CHARACTER SETS, DOUBLE BYTE 

1 {1 CHARACTER SETS, OR MIXED CHARACTER SETS COMPRISING BOTH 

1 |H SINGLE BYTE AND DOUBLE BYTE CHARACTER SETS (IBM Docket STL9-2000- 

1 10 0068), currently co-pendmg, and assigned to the same assignee as the present invention; and 

1 1^ Application Serial Number , filed concurrently herewith on July 1 0, 2000 

1 f 1 for METHOD OF, SYSTEM FOR, AND COMPUTER PROGRAM PRODUCT FOR 

1 SCOPING THE CONVERSION OF UNICODE DATA FROM SINGLE BYTE 

1 M CHARACTER SETS, DOUBLE BYTE CHARACTER SETS, OR MIXED 

2p CHARACTER SETS COMPRISING BOTH SINGLE BYTE AND DOUBLE BYTE 

2 1 CHARACTER SETS (IBM Docket STL9-2000-0069), currently co-pending, and assigned to 

22 the same assignee as the present invention. 

23 The foregoing copending applications are incorporated herein by reference. 

24 
25 

26 A portion of the Disclosure of this patent document contains material which is subject 

27 to copyright protection. The copyright owner has no objection to the facsimile reproduction by 

28 anyone of the patent document or the patent disclosure, as it appears in the Patent and 

29 Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. 
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I BACKGROUND OF THE INVENTION 

2 
3 

4 1. Field of the Invention 

5 

6 The present invention relates in general to coded character sets for representing 

7 characters in a computer program, and more particularly to a creation of Unicode characters by 

8 converting from non-Unicode characters. 
9 

10 

I I 2. Description of the Related Art 

ig 

1 3 ^ Unicode is a new internationally standardized data encoding for character data which 

I la allows computers to exchange and process character data in any natural language text. Its most 

1 }i common usage is in representing each character as a sixteen-bit number. This is sometimes 

1 S J called a "double-byte" data representation as a byte contains eight bits. 

113 

1 3: Most existing computer hardware and software represents specific sets of characters in 

1 g an eight-bit code, of which ASCH (American National Standard Code for Information 

2ffi Interchange) and EBCDIC (Extended binary-coded decimal interchange code) are typical 

2 1 examples. In such an eight-bit representation (also known as a single-byte representation), the 

22 limit of two-hundred-fifty-six (256) unique numeric values imposes a restriction on the set of 

23 distinct characters that may be encoded using the two-hundred-fifty-six distinct values. Thus, it 

24 is necessary to define different sets of encodings for each desired set of characters. 
25 

26 The chosen set of characters is called a "Character Set". Each member of the character 

27 set can be assigned a unique eight-bit numeric value ("Code Point") from the set of the two- 

28 hundred-fifty-six distinct values (Code Points). A group of assignments of characters and 

29 control fimction meanings to all available code points is called a "Code Page"; for example, the 
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1 assignments of characters and meanings to the two-hundred-fifty-six code points (0 through 

2 255) of an 8-bit code set is a Code Page. The combination of a specific set of characters and a 

3 specific set of numeric value assignments is called a "Coded Character Set", To distinguish 

4 among the many different assignments of characters to codings, each Coded Character set is 

5 assigned an individual identification number called a "Coded Character Set ID" (CCSID), 
6 

7 In situations involving ideographic scripts such as Chinese, Japanese, or Korean, a 

8 hybrid or mixed representation of characters is sometimes used. Because the number of 

9 ideographic characters greatly exceeds the tv^o-hundred-fifty-six possible representations 

10 available through the use of an eight-bit encoding, a special sixteen-bit encoding may be used 

1 1 instead. To manage such sixteen-bit representations in computing systems and devices built for 
12, eight-bit representations, two special eight-bit character codes are reserved and used in the 

\p eight-bit-character byte stream to indicate a change of alphabet representation. Typically, a 

ik^ string of characters will contain eight-bit characters in a single-byte representation. When the 

l p4 first of the two special character codes (commonly called a "Shift-Out" character) is 

]W encountered indicating a switch of alphabets, the bytes subsequent to the Shift-Out character 

\Y are interpreted as double-byte pairs encoded in the special sixteen-bit double-byte encoding, 

M At the end of the double-byte ideographic string, the other special eight-bit character code 
(commonly called a "Shift-In" character) is inserted to indicate that the following eight-bit 
bytes are to be interpreted as single-byte characters, as were those characters preceding the 

^ "Shift-Out" character. This hybrid representation is sometimes also called a "double-byte 

22 character set" (DBCS) representation. When such DECS strings are mixed with SBCS 

23 characters, the representation is sometimes called a "mixed SBCS/DCBS" representation. 
24 

25 Ideographic characters may also be represented as sixteen-bit characters in strings 

26 without any SBCS characters other than the special initial "Shift-Out" and final "Shift-In" 

27 character codes if they are used in a context where it is known that there are no mixtures of 

28 eight-bit characters and sixteen-bit characters. Such usage is sometimes called "pure DBCS". 

29 The Shift-Out and Shift-hi codes are still required as the text of the remainder of the program 
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1 may use single-byte encodings. 

2 

3 To illustrate, assume that the "Shift-Out" character is represented by the character 

4 and that the "Shift-In" character is represented by the character Then each of the three 

5 representations just described may be written as strings of these forms: 

6 'abcDEF' SBCS string 

7 'AB<wxyz>CD^ mixed SBCS/DBCS string 

8 '<wxy2>' pure DECS string 
9 

1 0 The actual computer storage representation of each of these three character formats 

1 1 would generally be similar to the following representations. For example, the SBCS string 

1 2i would generally appear in storage as follows: 

13^^ +--- + — + -- - + — -f — -f — + 

ih »abcDEF^ ialb|c|DlE|Fl 

l|3 4- + + + — 4- + + 

l6u Six bytes, one byte per character 

If 

llu The hexadecimal encoding of this string in a standard representation may appear as: 

jg + + + 4- + + + 

M I 81| 81j 83| C41 C51 C6 | 

^ 4. 4. ^ + + + 

22 

23 After translation to Unicode, the same characters may be represented by the following bytes 

24 (shown in hexadecimal encoding): 

25 + -h 4- + + + + + — + + + + + 

26 1 00 611 00 62| GO 63 1 00 44 1 00 45| 00 46| 

27 + + h + + + + h — + + + + h 

28 Twelve bytes, two bytes per cliaracter 
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1 Similarly, the computer storage representation of a mixed SBCS/DBCS string may 

2 generally appear as follows where 'wxyz' represents the four bytes needed to encode the two 

3 ideographic DBCS characters between the Shift-Out and Shift-In characters, and the '? ' strings 

4 indicate the specific encodings assigned to the representations of the DBCS characters: 
5 

g + + + + + --- + + 4- + + + 

7 'AB<wxYZ>CD> | A i B | < | ?? ?? | ?? ?? 1 > I C | D | 

g + 1- H + + + + 1- + + + 

9 Ten bytes 

10 

1 1 The hexadecimal encoding of this string in a standard representation may appear as follows 

1 2 (wherein the Shift-Out and Shift-In characters have encodings X'OE' and X'OF' respectively): 
jp +...+...+---+.-.+---+---+---+---+---+---+ 

l|g I Cl| C2| 0E| ??! ??| ??| ??| 0F| C3j C4| 

1^, +.--+---+---+---+---+---+---+---+---+---+ 

ig 

1 When translated to Unicode, the same characters may be represented by the these bytes (shown 

18 in hexadecimal encoding): 

lift + + + + + + + + — + — + — + + + 

2!(> 1 00 41| 00 42| ?? ?? 1 ?? ?? | 00 43 1 00 44] 

2g +...+...+..-+---+..-+---+---+---+---+---+---+---+ 

2S Twelve bytes, two bytes per character 

23 Note that the Shift-Out and Shift-In characters have been removed, as they are not necessary in 

24 the Unicode representation. 
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1 For the third type of character string containing pure DBCS characters, the computer 

2 storage representation may appear as follows: 

3 +..-+...+---+---+---+---+ 

4 '<wxy2>' I < I ?? ?? I ?? ?? I > I Six bytes 

5 +...+...+---+---+---+---+ 

6 The hexadecimal encoding of this string in a standard representation may appear as follows 

7 (wherein the Shift-Out and Shift-In characters have encodings X'OE' and X'OF' respectively): 
g +..-+-..+---+---+---+---+ 

9 I OEj ??1 ??| ??i ??1 0F| 

10 + — + - - - + — + — + — + - - - + 
11 

1 2 When translated to Unicode, the same characters would be represented by the these bytes 

1 |i (shown in their hexadecimal encoding): 

+ — + — + — +---+ 

i|7 I ?? ?? I ?? ?? i 

ly +...+.-.+.--+---+ 

l ift Four bytes, two bytes per character 

]f 

1 m In typical usage, many coded character sets are used to represent the characters of 

^ various national languages. As computer applications evolve to support a greater range of 

M national languages, there is a corresponding requirement to encompass a great multiplicity of 

M "alphabets". For example, a software supplier in England may provide an account 

23 management program to a French company with a subsidiary in Belgium whose customers 

24 include people with names and addresses in Danish, Dutch, French, Flemish, and German 

25 alphabets. If the program creates billings or financial summaries, it must also cope with a 

26 variety of currency symbols. Using conventional technology, it may be difficult, or even 

27 impossible, to accommodate such a variety of alphabets and characters using a single eight-bit 

28 coded character set. 
29 

30 In other applications, a program may be required to present messages to its users in any 



STL9-2000-0055 



1 of several selectable national languages (this is often called "internationalization"). Creating 

2 the message texts requires that the program's suppliers be able to create the corresponding 

3 messages in each of the supported languages, which requires special techniques for handling a 

4 multiplicity of character sets in a single application. 
5 

6 Unicode offers a solution to the character encoding problem, by providing a single 

7 sixteen-bit representation of the characters used in most appHcations. Hovv^ever, most existing 

8 computer equipment creates, manages, displays, or prints only eight-bit single-byte data 

9 representations. In order to simplify the creation of double-byte Unicode data, there is a need 

1 0 for ways to allow computer users to enter their data in customary single-byte, mixed 

1 1 SBCS/DBCS, and pure DBCS formats, and then have it converted automatically to the double- 
1 ^J byte Unicode representation. 
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SUMMARY OF THE INVENTION 



The present invention comprises a method, system, article of manufacture, and a 
computer program product for representing characters in a computer program, and more 
particularly to a creation of Unicode characters by converting from non-Unicode characters. 
A preferred embodiment of the present invention provides methods for specifying the types of 
constants whose character values are to be converted to Unicode; for specifying which code 
page or pages are used for specifying the character encodings used in the source program for 
writing the character strings to be converted to Unicode; and that can be used to perfonn 
conversions from SBCS, mixed SBCS/DBCS, and pure DECS character strings to Unicode. A 
syntax suitable for specifying character data conversion from SBCS, mixed SBCS/DBCS, and 
pure DBCS representations to Unicode utilizes an extension to the conventional constant 
subtype notation. In converting the nominal value data to Unicode, currently relevant SBCS 
and DBCS code pages are used, as specified by three levels or scopes derived from either 
global options, from local AOPTIONS statement specifications, or from constant-specific 
modifiers. Global code page specifications apply to the entire source program. These global 
specifications allow a programmer to declare the source-program code page or code pages just 
once. These specifications then apply to all constants containing a request for conversion to 
Unicode. Local code page specifications apply to all subsequent source-program statements. 
These local specifications allow the programmer to create groups of statements containing 
Unicode conversion requests, all of which use the same code page or code pages for their 
source-character encodings. Code page specifications that apply to individual constants allow a 
very detailed level of control over the source data encodings to be used for Unicode 
conversion. The conversion of source data to Unicode may be implemented inherently to the 
translator (assembler, compiler, or interpreter) wherein it recognizes and parses the complete 
syntax of the statement in which the constant or constants is specified, and performs the 
requested conversion. Alternatively, an external fimction may be invoked by a variety of 
source language syntaxes which parses as little or as much of the source statement as its 
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1 implementation provides, and returns the converted value for inclusion in the generated 

2 machine language of the object program. Alternatively, the conversion may be provided by the 

3 translator's macro instruction definition facility. 
4 

5 One aspect of a preferred embodiment of the present invention provides for the 

6 specification of the types of constants whose character values are to be converted to Unicode, 
7 

8 Another aspect of a preferred embodiment of the present invention provides for the 

9 specification of which code page or pages are used for specifying the character encodings used 
10 in the source program for writing the character strings to be converted to Unicode. 

11 

12, Another aspect of a preferred embodiment of the present invention performs 

lO conversions fi-om SBCS, mixed SBCS/DBCS, and pure DECS character strings to Unicode. 

l^f Another aspect of a preferred embodiment of the present invention provides a syntax 

IK suitable for specifying character data conversion from SBCS, mixed SBCS/DBCS, and pure 

ly DBCS representations to Unicode utilizing an extension to the conventional constant subtype 

l|^ notation. 

1^ 

2K Another aspect of a preferred embodiment of the present invention converts a nominal 

2P value data to Unicode using currently relevant SBCS and DBCS code pages as specified by a 

22 level or scope. 
23 

24 Another aspect of a preferred embodiment of the present invention provides a global 

25 level or scope comprising a global code page specification which applies to an entire source 

26 program. 
27 

28 Another aspect of a preferred embodiment of the present invention provides a local 

29 level or scope comprising a local code page specification which applies to all subsequent 
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1 source-program statements. 

2 

3 Another aspect of a preferred embodiment of the present invention provides an 

4 individual constant level or scope comprising a code page specification that applies to an 

5 individual constant. 
6 

7 A preferred embodiment of the present invention has the advantage of providing ease of 

8 Unicode data creation: data can be entered into a program using familiar and customary 

9 techniques, and in the user's own language and preferred character sets, without having to know 
1 0 any details of SBCS, DBCS, or Unicode character representations or encodings. 

11 

12 A preferred embodiment of the present invention has the further advantage of providing 

W an ability to handle multiple single-byte and double-byte input data encodings, each specific to 

a national language or a national alphabet. Such input data may be written in several 

ij^ convenient forms, such as SBCS, mixed SBCS/DBCS, and pure DBCS. 

m 

1J7 A preferred embodiment of the present invention has the fiirther advantage of providing 

iS a variety of scopes for specifying controls over source data representations and encodings, 

t9 such that the user has complete control over the range of these specifications, ranging from 

HI global (applying to all requested conversions in the entire program), local (applying to a range 

if of statements containing data to be converted) to individual or constant-specific (applying to a 

22 single instance of data to be converted). 
23 

24 A preferred embodiment of the present invention has the further advantage of providing 

25 an open-ended design allowing easy addition of supported character sets, by simply providing 

26 additional Mapping Tables for each supported character set, and without any need to modify 

27 the internal logic of the translator (assembler, compiler, or interpreter) to be cognizant of such 

28 added character sets and tables. 
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1 A preferred embodiment of the present invention has the further advantage of having no 

2 dependence on operating system environments or run-time conversion services, which may or 

3 may not be available in the environment in which character data in the source programs are 

4 being converted to Unicode and translated to machine language. 
5 

6 A preferred embodiment of the present invention has the further advantage of providing 

7 a special language syntax specifying constants to be converted to Unicode, creating no conflicts 

8 with existing applications. This syntax is also a natural and intuitively familiar extension of the 

9 existing syntax for specifying character constants. 
10 

1 1 A preferred embodiment of the present invention has the further advantage of having no 

1 2-- need to prepare nor accept programs written using Unicode characters, and no need for 

Ip special Unicode-enabled input^utput devices or mapping software, because of the ease of data 

m creation and the variety of data formats described above. 

l| 

0 A preferred embodiment of the present invention has the further advantage of providing 

\f an ability to implement conversions in multiple ways to provide flexibility, including 

1§ implementations m the translator itself ("native" hnplementation), or by using macro or 

# preprocessor instructions, or by utilizing the translator's support for externally-defined and 

M externally-written functions. 

ft 

22 A preferred embodiment of the present invention has the further advantage of providing 

23 an ability to support normal sixteen-bit Unicode and Unicode UTF-8 character formats as the 

24 results of converting any of the source data formats described above. 
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I BRIEF DESCRIPTION OF THE DRAWINGS 

2 
3 

4 For a more complete understanding of the present invention and the advantages thereof, 

5 reference is now made to the Description of the Preferred Embodiment in conjunction with the 

6 attached Drawings, in which: 
7 

8 Figure 1 is a block diagram of a distributed computer system used in performing the 

9 method of the present invention, forming part of the apparatus of the present invention, and 

1 0 which may use the article of manufacture comprising a computer-readable storage medium 

1 1 having a computer program embodied in said medium which may cause the computer system 
1 2.. to practice the present invention; 

IP 

1 12 Figure 2 is a block diagram of a mapping table data structure preferred in carrying out a 

1 preferred embodiment of the present invention; 

IE 

1 f Figure 3 and Figure 4 are flowcharts of method steps preferred in carrying out a 

1 ^ preferred embodiment of the present invention; and 

1^ 

2|)^ Figures 5, 6, and 7 are listings of computer program code which implements the 

2V method steps preferred in carrying out a preferred embodiment of the present invention. 
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1 DESCRIPTION OF THE PREFERRED EMBODIMENT 

2 
3 

4 Referring first to Figure 1, there is depicted a graphical representation of a data 

5 processing system 8, which may be utilized to implement the present invention. As may be 

6 seen, data processing system 8 may include a plurality of networks, such as Local Area 

7 Networks (LAN) 10 and 32, each of which preferably includes a plurality of individual 

8 computers 12 and 30, respectively. Of course, those skilled in the art will appreciate that a 

9 plurality of Intelligent Work Stations <IWS) coupled to a host processor may be utilized for 

10 each such network. Each said network may also consist of a plurality of processors coupled via 

1 1 a communications medium, such as shared memory, shared storage, or an interconnection 

1 23 network. As is common in such data processing systems, each individual computer may be 

coupled to a storage device 14 and/or a printer/output device 16 and may be provided with a 

1 1^ pointing device such as a mouse 17. 
li 

1 ^ The data processing system 8 may also include multiple mainframe computers, such as 

1 13 mainframe computer 18, which may be preferably coupled to LAN 10 by means of 

Ig^ communications link 22. The mainframe computer 18 may also be coupled to a storage device 

iK 20 which may serve as remote storage for LAN 10. Similarly, LAN 10 may be coupled via 

2p communications link 24 through a sub-system control unit/communications controller 26 and 

21 communications link 34 to a gateway server 28. The gateway server 28 is preferably an IWS 

22 which serves to link LAN 32 to LAN 10, 
23 

24 With respect to LAN 32 and LAN 10, a plurality of documents or resource objects may 

25 be stored within storage device 20 and controlled by mainframe computer 18, as resource 

26 manager or library service for the resource objects thus stored. Of course, those skilled in the 

27 art will appreciate that mainframe computer 18 may be located a great geographic distance 

28 from LAN 10 and similarly, LAN 10 may be located a substantial distance from LAN 32. For 

29 example, LAN 32 may be located in Belgium while LAN 10 may be located within England 
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1 and mainframe computer 18 may be located in New York. 
2 

3 Software program code which employs the present invention is typically stored in the 

4 memory of a storage device 14 of a stand alone workstation or LAN server from which a 

5 developer may access the code for distribution purposes, the software program code may be 

6 embodied on any of a variety of known media for use with a data processing system such as a 

7 diskette or CD-ROM or may be distributed to users from a memory of one computer system 

8 over a network of some type to other computer systems for use by users of such other systems. 

9 Such techniques and methods for embodying software code on media and/or distributing 
10 software code are well-known and will not be fiirther discussed herein. 

11 

1 2^ As will be appreciated upon reference to the foregoing, it may be desirable for a user to 

1 If develop a multi-lingual or multi-alphabet software application. For example, a user of a 

I'p: software supplier in England may develop an account management program on a workstation 

iS 12 for use on a French company's computer 26 wherein the French company has a subsidiary 

iP in Belgium running a computer 28 which must process requests from users operating 

\3 computers 30, each of which may be interfacing in a different language, such as Danish, Dutch, 

iK French, Flemish, or German, The present invention provides character specification and 
conversion capabilities to accommodate such a variety of alphabets and characters. 

w 

if The following description of an assembler based preferred embodiment of the present 

22 invention assumes familiarity with the assembly language described in "IBM High Level 

23 Assembler for MVS & VM & VSE Language Reference, Release 3", IBM Manual Number 

24 SC26-4940-02, and the assembler options and external fiinction interfaces described in "IBM 

25 High Level Assembler for MVS & VM & VSE Programmer's Guide, Release 3", IBM Manual 

26 Number SC26-4941-02. While this preferred embodiment of the present invention is described 

27 in the context of the IBM Assembler Language, it can apply to other language translators such 

28 as assemblers, compilers, and interpreters. 
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1 The invention concerns the creation of Unicode data, not its processing. The invention 

2 may be described in three ways: 

3 A. Methods for specifying the types of constants whose character values are to be 

4 converted to Unicode; 

5 B . Methods for specifying which code page or pages are used for specifying the character 

6 encodings used in the source program for writing the character strings to be converted 

7 to Unicode.; and 

8 C. Methods that can be used to perform conversions from SBCS, mixed SBCS/DBCS, 

9 and pure DECS character strings to Unicode. 

10 

1 1 In the following descriptions, the terms "source" or "source string" refer to the 

1 |i characters to be converted to Unicode, and "source code page" refers to the particular encoding 

1 M used to represent the source-string characters as numeric quantities. Similarly, the terms 

1 ¥ "target" or "target string" refer to the set of Unicode characters into which the source string is 

1 ft being converted. 

If standard Syntax 

l|] The terminology of the ffiM Assembler Language is used to specify character constants. 

1^: The DC ("Define Constant") instruction directs the Assembler to convert the characters 

M enclosed in apostrophes specified in the operand field to the proper machine language 

21 representation: 

22 DC C...SBCS characters..." Convert '...SBCS characters...' to the 

23 proper machine language representation. 

24 DC C'...SBCS and DBCS characters...' Convert '...SBCS and DECS characters...' 

25 to the proper machine language 

2g representation. Requires that the DBCS 

27 option be specified. 

28 DC G'...pure DBCS characters...' Convert '...pure DBCS characters...' to the 

29 proper machine language representation. 
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J Requires that the DECS option be 

2 specified. 

3 

4 The DC (Define Constant) statement has this general form: 

5 label DC (DUPLICATION_FACTOR)(TYPE)(MODIFIERS)'(NOMINAL_VALUE)' 

6 where the parentheses simply delimit the various fields, and are not part of the syntax of the 

7 statement. In general, only the TYPE and NOMINAL_VALUE fields are required. For 

8 example, a statement defining a character constant could be written: 

9 DC CThis is a character constant' 

1 0 where the TYPE is indicated by the letter C. 
11 

I Q In the general form of the DC statement, each of the parenthesized terms has the 

IP following meanings: 

l|j 

1 S 1 • The optional DUPLICATION_F ACTOR field specifies that the constant defined by the 

M following elements should be repeated a specified number of times. For example, 

l|l DC 3C'XYZ=' 

1^ would generate a machine language constant firom the character string 

■XYZ=XYZ=XYZ- , containing three repetitions of the nominal value string 'XYZ='. 

i§ 

21 2. TYPE specifies the type of encoding to be created for the values specified in the 

22 NOMINAL_VALUE field. In a preferred embodiment of the present invention, types 

23 of specific interest and applicability are Types C and G, for "Character" and "Graphic" 

24 constants, respectively. Other TYPE values are used to indicate that the 

25 NOMIN AL_V ALUE data should be converted to machine language data 

26 representations such as binary integer, floating point, packed decimal, and others as 

27 described in the "High Level Assembler Language Reference" citation. 
28 

29 The TYPE specification may also include a "subtype" specification to provide 
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1 additional refinements in the type of conversion to be performed. For example, the "D" 

2 type indicates that the NOMINAL_VALUE is to be converted to an eight-byte floating 

3 point representation; two subtypes are supported, such that "DH" indicates conversion 

4 to hexadecimal floating point, and "DB" indicates conversion to binary floating point. 
5 

6 3. MODIFIERS specify additional information to be used in creating the generated 

7 machine language constant. A preferred embodiment of the invention is primarily 

8 concerned with the "Length" modifier, w^hich asserts the exact length required for the 

9 generated data. An additional modifier may be used for specifying code pages to be 
1 0 used in converting individual constants. 

11 

1 2=. 4. NOMINAL_VALUE is the data to be converted. A preferred embodiment of the 

133 present invention is concerned with character data in three forms: SBCS data, mixed 

1 J SBCS/DBCS data, and pure DBCS data. 

Ip Literals 

17 

1 Wi Literals are a convenient form of declaring a constant in the immediate context of its 

Vt use. For example, to set the address of a character constant into General Purpose Register 2, a 

25 programmer may write the "Load Address" (LA) instruction thus: 

ff LA 2,=C'A Character Constant' 

22 where the equal sign indicates to the assembler that the following operand is a literal constant. 

23 The assembler effectively creates a hidden internal name for the constant, replaces the literal 

24 operand in the statement by the internal name, and places the text of a statement defining the 

25 constant labeled with the internal name in a designated (or default) place in the program. This 

26 saves the programmer from having to write two statements, such as: 

27 LA 2,Char_Const 

28 — other statements 

29 Char-Const DC C'A Character Constant' 
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1 Literals can easily be supported for all constant types described in this preferred 

2 embodiment of the present invention, and will therefore not be discussed further; such support 

3 is assumed throughout. 
4 

5 The assembler also supports specialized forms of character-like data called "self- 

6 defining terms". These comprise decimal, binary, hexadecimal, character, and graphic (pure 

7 DECS) forms. The values of all self-defining terms are fixed and predefined by the assembler. 

8 For example, the self-defining terms 193, B'l 1000001', X'Cl', and CA' are required to have 

9 identical values. For this reason, no dependence on code page specifications can be allowed for 
1 0 character or graphic self-defining terms, as their values would not be fixed. 

11 

12 In the IBM High Level Assembler, the syntactic character set consists of 

113 (a) upper-case and lower-case letters; 

(b) decimal digits; and 

1 y (c) the special characters: + -/ *., ()@#_&' = blank. 

1 60 (d) The syntactically significant alphabetic character "$" ("currency symbol") is not invariant 

1 f across EBCDIC code pages; the Assembler Language requires it to have encoding X'5B', or 9 1 

l|=; decimal. 

1^: 

2^, Other characters are invariant across code pages, but they are not syntactically 

2l='' significant: 

22 ;:?"%<> 

23 

24 The invariance or non-invariance of various syntactic characters is not significant to this 

25 preferred embodiment of the present invention, other than providing a vehicle for the proper 

26 recognition of character strings to be converted to Unicode. The character set used in character 

27 data may contain SBCS and DECS character encodings fi-om many possible code pages 

28 without affecting the syntactic or semantic behavior of the program, because the contexts 

29 specifying data to be converted to Unicode are limited and well defined. 
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1 This preferred embodiment supports the common programming practice that source 

2 programs (symbols, operation codes, etc.) are always created using a syntactic character set, 

3 which includes those characters needed by the assembler or other programming-language 

4 translator to correctly parse and tokenize the elements of the source program, and to identify 

5 those program elements specifically requesting conversion to Unicode. Conversion-specific 

6 character data appears only in restricted contexts, between the enclosing apostrophes of the 

7 CU-type or GU-type constants described below. Text to be converted may therefore be 

8 encoded in any desired manner. 
9 

1 0 Although the preferred embodiment described uses the Extended Binary Coded 

1 1 Decimal Interchange Code (EBCDIC) for all but character data, this invention applies to any 
1 2,, conventional character set used for creating programs, such as ASCII. 

li 

li 

1^ A* Source Data Specification Extensions for Unicode 

li 

ly A syntax suitable for specifying character data conversion from SBCS, mixed 

M SBCS/DBCS, and pure DBCS representation to Unicode utilizes the constant subtype notation 

1^ described above. To specify that the nominal value of character data is to be converted to 
Unicode, a programmer may write: 

DC CU'...SBCS data...' Convert SBCS to Unicode 

22 DC CU'...SBCS/DBCS data...' Convert mixed SBCS/DBCS to Unicode 

23 Requires DBCS option 

24 DC GU'...pure DBCS data...' Convert pure DBCS data to Unicode 

25 Requires DBCS option 

26 The first of these is called "pure SBCS data" or simply "SBCS data". The second is called 

27 "mixed SBCS/DBCS data", or simply "mixed data". The third is called "pure DBCS data". 
28 

29 hi the preferred embodiment using the IBM High Level Assembler, the second and third 
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1 of these examples require that the DECS option be specified so that mixed SBCS/DBCS data is 

2 recognized correctly, but other forms of recognition rules or syntaxes for the nominal value 

3 could also be used. 
4 

5 Another language extension provided by the preferred embodiment defines a new 

6 constant type specifically for Unicode by assigning a TYPE code 'U'. Thus, a constant to be 

7 converted to Unicode may be written: 

8 DC U'Text to be converted' 

9 which could be equivalent in other respects to a constant of type C. If this form is chosen, an 

1 0 additional type letter would also have to be assigned to accommodate pure DBCS data, by 

1 1 analogy with the G-type constant. Because the assembler has already assigned a large range of 
1 letters for constant types, the method using a 'U' subtype described above is more economical 

I ^ in its use of the available set of type codes. 

m 

I ^ In converting the nominal value data to Unicode, the assembler uses the currently 

M relevant SBCS and DBCS code pages, as derived from global options or fi-om local 

iF AOPTIONS statement specifications, or from constant-specific modifiers, as described below. 

li 

li^ As with other character-based data types, no particular data alignment in storage is 

ij assumed. However, since Unicode data naturally occurs in two-byte (sixteen-bit) forms, data 

{? alignment on two-byte boundaries could easily be supported if processing efficiencies indicate 

22 that doing so would be beneficial. 
23 

24 To simplify usage of these new constant types, the syntax of CU-type and GU-type 

25 constants preferably should be unchanged from the current language definition for C-type and 

26 G-type constants. This allows users who are familiar with existing coding styles and 

27 conventions (i.e., the syntax of C-type and G-type constants) to utilize this invention with 

28 minimal additional effort. 
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1 B. Methods for Specifying Source Code Pages 

2 

3 There are three levels, or "scopes", at which source code pages can be specified: 

4 1 . Global code page specifications that apply to the entire source program: these "global" 

5 specifications allow the programmer to declare the source-program code page or code 

6 pages just once. These specifications then apply to all constants containing a request for 

7 conversion to Unicode. 

8 2. Local code page specifications that apply to all subsequent source-program statements: 

9 these "local" specifications allow the programmer to create groups of statements 

1 0 containing Unicode conversion requests, all of which use the same code page or code 

1 1 pages for their source-character encodings. For example, a program might contain 
11^ statements defining messages in each of several national languages; each grouping 
1^ could be preceded by such a "local" code page specification that applies to all the 

l|4; statements of that group, until a subsequent local specification is provided that applies 

ife to the following group. 

lg| 3. Individual constant code page specifications that apply to individual constants; these 

1 7" allow a very detailed level of control over the source data encodings to be used for 

Unicode conversion. For example, if a message in one national language must contain a 

Ip^ segment written in a different national language, each segment of the message can 

2K specify the encoding used for its characters. 

22 

23 BJ. Global Source Code Page Specification 

24 

25 Global source code specifications apply to all DC ("Define Constant") statements in the 

26 source program to which Unicode conversion should be applied. These global specifications 

27 would typically be specified as "options" or "parameters" presented to the Assembler at the 

28 time it is invoked or initialized, so that the Assembler can set up any needed information that 

29 will apply to the entire source program translation. 
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1 The forms that such global source code specification options may take include: 

2 CODEPAGE(nnn) specifies a SBCS code page 

3 CODEPAGE(nnn,nnn...) specifies a set of SBCS code pages (Example 1) 

4 CODEPAGE(nnn,sss) specifies a SBCS code page and a DBCS code page 

5 DBCS(sss) specifies a DBCS code page and enables recognition of DBCS 
5 data 

7 DBCS(sss,sss,...) specifies a set of DBCS code pages and enable recognition of 

8 DBCS data (Example 2) 

9 DBCS(CODEPAGE(sss)) specifies a DBCS code page and enables recognition of DBCS 
] 0 data 

1 1 and so forth, where values such as nnn and sss are Coded Character Set IDs (CCSIDs). 

1 2 = Combinations and variations of the above, as well as abbreviations of the keywords, are equally 
1 33 useful. Default code page values can also be specified at the time the Assembler is installed on 
lil the user's system, allowing Unicode translations to be specified in the program without the 

1 i= need for invocation or initialization options. 
1|3 

1 f In addition to these "invocation" options, the preferred embodiment allows the user to 

1 §1 specify certain options to be included in the statements of the source program, using the 

1 ¥ *PROCESS statement. Thus, any of the above option forms could be placed in the source 

2S module with a statement like: 
2p *PROCESS CODEPAGE(nmi) 

22 and so forth, for all possible variations. An additional capability is provided with the 

23 *PROCESS statement: if the OVERRIDE(...) option is specified, as in: 

24 *PROCESS OVERRIDE(CODEPAGE(nnn)) 

25 With the OVERRIDE(...) option, the user can thereby specify that no matter what CODE? AGE 

26 options are specified when the Assembler is invoked, the global CODEPAGE value or values 

27 cannot be changed from the value(s) required to produce correct conversion of the constants in 

28 the source program. 
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1 B.2. Local Source Code Page Specification 

2 

3 The IBM High Level Assembler provides a mechanism allowing users to make local 

4 adjustments or overrides to options that can also be specified "globally". This mechanism is 

5 the ACONTROL statement. For example, if the user wishes that the assembler not diagnose 

6 certain substring operations, the user may specify: 

7 ACONTROL FLAG(NOSUBSTR) (Assembler ignores possibly-invalid substring 

8 techniques) 

9 — statements with unusual substring coding techniques — 

1 0 ACONTROL FLAG(SUBSTR) (Assembler resumes checking substring 

1 1 tecliniques) 

1 3;:; The ACONTROL statement can be used to specily localized controls over the source 

IW" code pages to be used for converting designated forms of character data to Unicode, For 

example, distinct groups of statements can be converted to Unicode from separate code pages 

iK as follows: 

If ACONTROL CODEPAGE(nnn) 

1§ — statements with character data to be converted to 

Ip! — Unicode using code page with CCSID nnn 

is ACONTROL CODEPAGE(mmm) 

2 f — statements with character data to be converted to 

22 — Unicode using code page with CCSID mmm 
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1 Alternatively, if it is desired to specify multiple code pages to be used in converting 

2 constants in subsequent statements, the ACONTROL statements could be specified in 

3 alternative forms, such as: 

4 .* Example 3 

5 ACONTROL CODEPAGE(nnnl,nnn2,,.XDCBS(CODEPAGE(sssl,sss2,.,0) 

6 — statements v^ith character data to be converted to Unicode 

7 — using code pages selected among the nnn and sss values 

8 .* Example 4 

9 ACONTROL CODEPAGE(mmml,mmm2,...),DBCS(CODEPAGE(tttUttt2,„.)) 

1 0 — statements with character data to be converted to Unicode 

1 1 — using code pages selected among the mmm and ttt values 

12 Thus, all the various formats of "global" options could be specified on ACONTROL 
1^ statements, 

ifej hi cases where the user wishes to revert from a local source code page specification to 

Ifil the global source code page specification, the following special notation may be used: 

1^^ ACONTROL CODEPAGE(*) (Revert to global source code page specifications) 

# 

Later, conversion and implementation techniques are described that involve methods 

^ that do not require direct implementation in the assembler itself, such as macro instructions and 

W external fimctions. To assist such methods, the assembler can capture information fi-om the 

22 options and/or ACONTROL statements in global system variable symbols. These system 

23 variable symbols are a method whereby the assembler can provide environmental and status 

24 information to macros and fimctions. hi implementing conversion to Unicode data formats, the 

25 assembler can capture the designations of current code pages in system variable symbols such 

26 as: 

27 &S YS_SBCS_CODEPAGE current SBCS code page 

28 &SYS„DBCS_CODEPAGE current DECS code page 

29 The advantages of this increment in assembler capability will be illustrated below. 
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1 B3. Specifying Source Code Page for Individual Constants 

2 

3 The most discriminating level of code page specification is at the level of an individual 

4 constant. This invention involves adding a novel modifier, - P meaning "Code Page"-, to the 

5 existing syntax for specifying constants to provide information about the code page or code 

6 pages used to create the source data for the constant. 
7 

8 To provide code page specifications for individual constants, another novel form of 

9 modifier is introduced: 

10 DC CUP(nnn)',..SBCS data.,; 

1 1 vv^hich requests that the SBCS data provided using code page "nnn" be converted to Unicode. 
l|i DC CUP(nnn,sssy...mixed SBCS/DBCS data,..' 

1 requests that the mixed SBCS/DBCS data provided using code page "nnn" for the SBCS data 

ik^ and the code page "sss" for the DBCS data be converted to Unicode. 
1 15 DC GUP(sss)'...pure DBCS data...' 

IS" requests that the pure DBCS data provided using code page "sss" be converted to Unicode. 
IP 

The above examples demonstrate the use of an explicit numeric specification of the 

ife value of the code page modifier. It is common practice in programming languages to use 

2© symbolic forms for important numeric quantities; this invention supports this technique. For 

2 1 example, if the statement: 

22 MyCodePage Equ 1148 

23 is used to declare that the symbol "MyCodePage" is equivalent to the value 1 148, then the 

24 following two statements will be treated identically: 

25 DC CUP(1 148)Text using code page 1 148' 

26 DC CUP(MyCodePage)Text using code page 1 148' 

27 Thus, uses of this invention are not limited to strictly numeric specification of CCSIDs in all 

28 programming contexts. 
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1 For situations where more than one SBCS or DBCS code page is currently available (as 

2 exemplified in Examples 1,2,3, and 4 above), individual constants could refer indirectly to 

3 one of the previously specified code pages using a special "indicator" notation to select the 

4 desired code page. For example, suppose the ACONTROL statement of Example 3 

5 immediately preceded these constants: 

6 DC CUP(=l)'Convert this with code page nnnl' 

7 DC CUP(=2)'Convert this with code page nnn2' 

8 The notations "= 1 " and "=2" are intended to indicate that the first and second code pages 

9 declared in the ACONTROL statement should apply to each respective constant. The choice of 

1 0 the "=" character is of course arbitrary, and could be any character not allowed in valid 

1 1 language symbols. This level of constant-specific code page specification could also be used 
1% with U-type constants, as described above. Additional modifiers (such as length) can also be 
iP supported without any modifications to the existing language rules or implementation within 
lifll the assembler. 

ll^ C. Conversion Techniques 

m 

\g Three alternative embodiments for implementing the conversion of source data to 

W Unicode will be described: 

^ 1 . The implementation is inherent to the assembler itself: the assembler recognizes and 

2 1 parses the complete syntax of the statement in which the constant or constants is 

22 specified, and performs the requested conversion. 

23 2. The implementation is provided in the form of an external function that can be invoked 

24 by a variety of source language syntaxes. The external fimction can parse as little or as 

25 much of the source statement as its implementation provides, and return the converted 

26 value to the assembler for inclusion in the generated machine language of the obj ect 

27 program. 

28 3 . The implementation is provided by the assembler's macro instruction definition facility. 

29 Each of these implementation techniques will be illusti-ated below. 
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1 

2 Mapping Tables 

3 

4 A key element of the conversion process is the Mapping Table. One mapping table is 

5 created for each source code page, as identified by its CCSID. Each mapping table contains the 

6 Unicode character corresponding to each single-byte or double-byte character in the specified 

7 coded character set, arranged in ascending order of the numeric encoding assigned to each 

8 source character, as illustrated in Figure 2. 
9 

1 0 A Mapping Table 280 typically consists of a fixed-length header 282 containing a 

1 1 number of fields identifying the table and its status, so that the assembler can verify that the 
correct table is being used for the requested conversion. Following the header are the Unicode 

ig characters 284, 286, ... and 288 in die exact order of the numeric encoding assigned to the 

[fs 

W' corresponding source character. 

ig 

1^ Thus, the Unicode character corresponding to the source character having a numeric 

S encoding value of 1 would be found at 286. Similarly, the Unicode character correspondmg to 

^ the sourxje character having a numeric encoding value of k would be found at 288. 

A Mapping Table for a SBCS character set would typically have two-hundred-fifty-six 

if Unicode character entries, while a mapping table for a DECS character set could have as many 

22 as sixty-five-thousand-five-hundred-thirty-six (65536) Unicode character entries. If it is known 

23 that certain restrictions may be imposed on the range of encoding values permitted for the 

24 source characters, then the contents of the mapping tables can be optimized to take advantage 

25 of those restrictions. For example, typical DECS character encodings do not permit 

26 assignment of numeric encoding values less than sixteen-thousand-seven-hundred-five 

27 ( 1 6705), so that mapping table entries would not be necessary for converting those encodings. 
28 

29 Note that for any given constant, either one or two mapping tables will be required for 
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converting the nominal value of the constant to Unicode. For SBCS and pure DBCS data, only 
a single mapping table is needed; for mixed SBCS/DBCS data, two mapping tables are 
required: one for the SBCS data and one for the DBCS data. 

Table 1 illustrates typical assignments of Coded Character Set IDs (CCSIDs) 
commonly used for single-byte encodings of character sets in widespread use. Further details 
may be found in the manual "IBM National Language Design Guide, Volume 2" (manual 
number SE09-8002-03). 



Table 1 

Examples of CCSIDs for Commonly Used SBCS Character Sets 



SBCS CCSID 1 


DESCRIPTION 


01140 


USA, Canada, Netherlands, Portugal, Brazil, Australia, New Zealand 
(00037 with euro) 


01141 


Austria, Germany (00273 with euro) 


01142 


Denmark, Norway (00277 with euro) 


01143 


Finland, Sweden (00278 with euro) 


01144 


Italy (00280 with euro) 


01145 


Spain, Latin America (Spanish) (00284 with euro) 


01146 


United Kingdom (00285 with euro) 


01147 


France (00297 with euro) 


01148 


Belgium, Switzerland, International Latin- 1 (00500 with euro) 
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1 Table 2 shows examples of typical code pages used for DECS data in pure DECS or 

2 mixed SBCS/DBCS contexts. 
3 

4 

5 Table 2 



6 Examples of DECS Code Pages Suitable for Unicode Conversion 



7 


DECS CCSE) 


DESCRIPTION 


8 


00935 


Simplified Chinese (S-Chinese) Host Mixed (including 1 880 UDC and 
Extended SBCS) 


9 


00937 


Traditional Chinese (T-Chinese) Host Mixed (including 6304 UDC and 
Extended SBCS) 


S 


04396 


Japanese Host Double-B3^e (including 1 880 UDC) (User Definable 
Characters) 


m 


09125 


Korean Host Mixed (including 1880 UDC) 



m 
ft 



ly CA. Assembler Implementation 

1= 

IM 

t% To adapt the CCSID of a mapping table to a format usable by internal or operating 

1 8 system services to locate the required mapping table, the assembler can employ a variety of 

19 methods. One such technique uses the observation that each CCSID is sixteen bits long, and 

20 that its hexadecimal representation therefore contains exactly four hexadecimal digits. For 

2 1 example, CCSE) number 01 148 is equivalent to the hexadecimal value X'047C'. If those four 

22 hexadecimal digits are converted to character form, they can be attached to a standard prefix 

23 and used as a module name. For example, in the IBM High Level Assembler, such a module 

24 name could be created from a prefix 'ASMA' and a suffix given by the four hexadecimal digits, 

25 in this case 'ASMA047C'. This constructed name can then be used as the name of the mapping 

26 table in all service requests involving finding and loading the mapping table. 
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1 Referring now to Figure 3 and Figure 4, the flowcharts illustrate the operations 

2 preferred in carrying out the preferred embodiment of the present invention. In the flowcharts, 

3 the graphical conventions of a diamond for a test or decision and a rectangle for a process or 

4 function are used. These conventions are well understood by those skilled in the art, and the 

5 flowcharts are sufficient to enable one of ordinary skill to vmte code in any suitable computer 

6 programming language. 
7 

8 Referring first to Figure 3, the conversion proceeds as follows. After the start 302 of 

9 the conversion program, the assembler establishes at process block 304 the code page or code 

1 0 pages used in the source text for specifying the nominal value of the data to be converted to 

1 1 Unicode, Thereafter, at decision block 306, the assembler determines whether the mapping 
123 tables needed for converting source data written in the source-data code pages are currently 
l|J available. If they are, the assembler proceeds to process block 314 to begin the conversion 

If: process. Otherwise, if the mapping tables needed for the conversion are not currently available, 

113 then the assembler at process block 307 uses standard operating system services to load the 

l|rj appropriate mapping table. Thereafter, decision block 308 determines if the load of the 
1 mapping table was successful. If for any reason the loading process fails, then the assembler at 

I8J process block 310 issues appropriate error messages and terminates its attempt to corrvert the 

1^1 constant 312, 

2 1 Returning now to process block 314, to begin the conversion process the assembler 

22 parses the source string to determine the number of characters it contains. These source 

23 characters can be SBCS or DECS characters. The number of these characters is assigned to the 

24 variable NCS, Then, at process block 316, the assembler sets a counter "K" for characters from 

25 the source string to 1 . Thereafter, assembler process block 318 extracts the K-th character from 

26 the source string. Using the binary value of the character (which will be an 8-bit value for 

27 SBCS characters, and a 1 6-bit value for DBCS characters), assembler process block 320 

28 extracts the Unicode character firom the mapping table that whose position corresponds to that 

29 binary value. This extracted value is then stored in the K-th position of the target string, as 
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1 illustrated in Figure 2. 

2 

3 After each Unicode character is stored in the target string, assembler process block 322 

4 increases the value of K by one, and its new value is then compared to the number of characters 

5 NCS by decision block 324. If the value of K does not exceed the value of NCS, program 

6 control is returned to process block 318 to obtain and convert the next source character. If the 

7 value of K exceeds the value of NCS, then conversion of the constant is complete, and the 

8 Unicode character string is placed in the machine language of the object file for the program by 

9 process block 326. Thereafter, the program ends at process block 312. 
10 

1 1 The process of selecting source-string characters in process steps 316 through 324 of 

1 0 Figure 3 are described in greater detail in Figure 4 to show how SBCS and DBCS source 

ifi characters are selected. The source string is assumed to have previously been validated for 

iK syntactic and semantic correctness. After the start 402 of the scanning process, the scanning 

IP process is initialized 404 by setting a scan pointer to the address of the first byte of the source 

l^J String, the nominal value of the constant. Initialization also sets a binary switch to indicate that 

iji the scan will proceed initially in "Single-Byte" mode. This switch is also used to determine 

W which Mapping Table (SBCS or DBCS) should be used to translate source characters to 

1@ Unicode. 

2 1 Thereafter, the byte pointed to by the scan pointer is checked by decision step 406 to 

22 see if it is a "Shift-Out" character, indicating the start of a DBCS string. If the character is not a 

23 Shift-Out character, program control proceeds to process step 408 which determines that the 

24 source characters are part of an SBCS character set. Process step 408 also uses the source 

25 character pointed to by the scan pointer as the index into the SBCS Mapping Table, as 

26 indicated in process step 320 of Figure 3, to perform the translation of process step 410 which 

27 translates the source character to Unicode. Thereafter, process step 412 increments the scan 

28 pointer by one byte to point to the next byte of the source string. Decision step 414 then 

29 determines if the scan pointer now points past the end of the source string. If the scan pointer 
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1 now points past the end of the source string, then the translation is complete, process step 416, 

2 and the assembler resumes normal statement processing, process step 418. 
3 

4 Returning now to decision step 406, if the byte pointed to by the scan pointer is a 

5 "Shift-Out" character, then control proceeds to process step 420 which increments the scan 

6 pointer by one byte, effectively discarding the "Shift-Out" character. The binary switch 

7 described at process step 404 is also set by process step 420 to indicate DBCS mode, thereby 

8 allowing selection of the current DBCS Mapping Table to perform the translation as illustrated 

9 in Figures 2 and 3. Thereafter, process step 422 uses the two bytes pointed to by the scan 

1 0 pointer as the source character. Process step 424 then translates this source character to 

1 1 Unicode using the DBCS Mapping Table. After the translation of the DBCS two-byte 

1 23 character, control proceeds to process step 426 which increments the scan pointer by two bytes 

1 S to step over the DBCS source character just translated. Decision step 428 tests the following 

1 1^ byte to determine if it is a "Shift-In" character, which would indicate the end of the DBCS 

1 p portion of the source string. If the tested byte is not a "Shift-In" character, then program 

m control returns to process step 422 to process the next DBCS source character. Otherwise, if 

\% the byte tested by decision step 428 is a "Shift-In" character, then program control proceeds to 

process step 430 which resets the binary switch to indicate that SBCS mode is now active, 

lb Thereafter, program control passes to the previously described process step 412 which 

increments the scan pointer by one byte, effectively discarding the "Shift-In" character. 

21 
22 

23 C.1.1. Length Modifiers 

24 

25 The Length modifier is supported by the assembler for most constant types. For 

26 character constants, it is written in the form: 

27 DC CL(m)'This is a Character Constant' 

28 where the generated machine language object code for the constant is required to have length 

29 exactly "m" bytes. This means that the character string in the nominal value field could either 
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1 be truncated (if m is smaller than the length of the nominal value string), or padded on the right 

2 with blanks (if m is larger than the length of the nominal value string). In the case of Unicode 

3 constants the implementation may or may not require that any length modifiers of the form: 

4 DC CUL(m)'...' 

5 DC GUL(m)'<...>' 

6 must evaluate to even values of "m". If "m" is odd (indicating that a Unicode character does 

7 not contain the expected 1 6 bits), a diagnostic may be given and corrective action may be 

8 taken. 
9 

10 

II C.2. Implementation Using Macro Instructions 

1 33 Many assembler programs support some form of "macro-instruction" capability that 

1 al lows the programmer to create new capabilities whose invocations resemble ordinary 

1^ instructions. 

M 

If C.2.1. Macro Instruction to Perform Basic Checking 

M The most trivial level of Unicode support could be a macro instruction whose argument 

2O is a character string of hexadecimal digits, in which the user has manually encoded the 

2T representation of each Unicode character. The primary function of such a macro could be 

22 validate that the argument string contains a multiple of four hexadecimal digits corresponding 

23 to an integral number of Unicode characters, and that each group of four hexadecimal digits 

24 corresponds to a true Unicode character. For example, a DCUX macro instruction could be 

25 written such that the user might write: 

26 DCUX X'...hex data...' 

27 or 

28 DCUX '...hex data...' 

29 or 
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1 DCUX „.hex„data... 

2 and the macro could verify that the number of hexadecimal digits is a multiple of 4, and that 

3 the Unicode characters are valid. 
4 

5 

6 C.2-2. Macro Instruction to Perform Checking and Conversion 

7 

8 A more powerftil technique for supporting the conversion of character data to Unicode 

9 characters is to create a macro definition with internal logic that performs a mapping similar to 

1 0 that illustrated in Figures 3 and 4, Implementation of such a macro definition could also 

1 1 include any needed mapping tables within the body of the definition. 

1|5 

13:f An advantage of using macro instructions is that they utilize the existing facilities of the 

IP assembler, and therefore do not require changes to the internal operation of the assembler. 
1 Their primary disadvantage is that macro definitions must be executed interpretively when 

ifei invoked, so they are slower than the same function implemented "natively" in the internal logic 

I f of the assembler. They also require extra coding for each additional code page being 

ISj supported. Thus, macro instructions provide an excellent means for testing and validating 

1^' conversion concepts, as well as a rapid development tool for situations where generality and 
speed are not critical. 

if 

22 In a typical implementation, a macro instruction would be defined in such a way that its 

23 arguments include a character string to be converted to Unicode, and an implicit indication 

24 (using the system variable symbols described above) or explicit indication (by providing a 

25 descriptive argument) of the CCSID of the code page in which the character string is 

26 represented. The macro instruction would then generate directly the machine language 

27 constant containing the Unicode data. 
28 

29 There are many ways to use macros for Unicode conversions. To illustrate, suppose the 
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1 following syntax is defined: 

2 DCU '...character data...'.CODEPAGE=nnn,DBCS=sss 

3 where the three operands have these meanings: 

4 1 . The first operand, "...character data...', consists of the character data to be converted to 

5 Unicode, enclosed in quotation marks recognizable by the macro processor. 

6 2. The second operand, CODEPAGE=nnn, specifies the SBCS code page used for 

7 encoding the first operand. If omitted, this operand would imply a default value for the 

8 code page. 

9 3 . The third operand, if present, indicates that the first operand contains either mixed 

1 0 SBCS/DBCS or pure DBCS data, and provides the code page in which the DECS data 

1 1 is encoded. If omitted, this operand would imply that the first operand contains only 
12:,, SBCS data. 

iP 

I An implementation of such a macro instruction which may be used to create Unicode 

1 p character constants is illustrated in Figures 5, 6, and 7. It does not support the third operand 

1 K described above, but is intended to illustrate how a macro instruction can be used for Unicode 

iF conversions. The macro uses the default code page with CCSID 500, the same as that used by 

1^ the assembler for its syntactic character set phis other invariant characters. Extending the 

1|^ macro to accept other code pages is straightforward. 

li 
if 

22 C.3. Implementation Using External Functions 

23 
24 

25 The High Level Assembler supports a powerful capability for calling extemally- 

26 provided fimctions that can perform a variety of processing operations. Using an external 

27 ftinction requires defining the function in such a way that the assembler can locate and call it 

28 during the assembly process, passing data supplied by the program to the function, and 

29 receiving values returned by the function. In the context of converting character data to 
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1 Unicode, a call to such a ftinction could take a form such as the following: 

2 &Retumed„Val SetCF 'ExtJFunc'/character dataVother jparameters* 

3 The symbol "&Retumed„Var' is where the called function "Ext_Func" places its computed 

4 value as calculated from the other arguments. These arguments would typically include the 

5 character string to be converted to Unicode, the code page or code pages used in coding the 

6 character data, and any other values that might be useful to the external function, hi practice, 

7 the returned value would normally be substituted into a character or hexadecimal 

8 constant, which the assembler would then map directly into the machine language form of the 

9 Unicode constant. 
10 

1 1 An external function has much of the flexibility of the assembler itself: it can access 

1|"--| mapping tables as needed, as well as any other services of the operating system environment in 
1 which the assembler itself is executing. Any error conditions can be reported to the assembler 

1 4=^-- using a message passing interface. 

ig 

1 Further implementations of Unicode conversions could use a mixture of macro 

1 1 instructions and external functions, in such a way that the user writes a statement such as the 

lij following: 

\¥ DCUNI '..character data..;,CODEPAGE=37 

2ffi and the macro instruction could then pass the character data, the code page CCSID, and any 

21"^ other useful or necessary information to an external function to perform the required 

22 conversion. It could also generate the character or hexadecimal constant directly, in such a way 

23 that the above DCUNI instruction appears to be "native" to the assembler itself. Other 

24 implementations using external functions are of course possible. 
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I UTF-8 

2 

3 UTF-8 is a special version of the Unicode representation, chosen for its suitability for 

4 transmission over communication protocols designed for eight-bit characters. These protocols 

5 are sensitive to specific eight-bit codes (such as control characters) that could appear in a 

6 stream of valid sixteen-bit Unicode characters, and the transmission of normal Unicode data 

7 would very likely be distorted. To avoid this problem, the Unicode standard defines UTF-8 as 

8 a reversible mapping of sixteen-bit Unicode characters to a special string of one to four eight- 

9 bit bytes, such that none of the eight-bit bytes have special meanings to transmission protocols. 
10 

I I The assembler could easily provide conversion of SBCS, mixed SBCS/DBCS, and pure 
1^. DECS data to the UTF-8 representation, thus avoiding the need for possibly expensive run- 
1# time conversions for each item of Unicode data being transmitted. In terms of the previous 

L=4^ discussion: 

ll^^ 1 . Specification of a request for UTF-8 conversion can be provided globally, using an 
ifl assembler option such as UTF8, or an operand on a *PROCESS statement such as: 

hi *PROCESS UTF8 

M 2. Specification of a request for UTF-8 conversion can be provided locally, using an 
il operand of the ACONTROL statement, such as: 

20 ACONTROL UTF8 

it 3. specification of a request for UTF-8 for an individual constant can be provided by a 

22 modifier, such as: 

23 DC CUTF8'Characters to be converted to UTF-8' 

24 Thus, it can be seen that all of the methods described above for specifying the scope of 

25 conversion to Unicode can be applied to the requirements for conversion to the UTF-8 

26 representation. It should be noted that UTF-8 data is not required to occupy an even number of 

27 eight-bit bytes, so that possible checks and diagnostics for an even number of bytes would not 

28 apply. However, in situations where a length modifier causes improper truncation of a UTF-8 

29 byte string, a diagnostic would be appropriate. 
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1 Using the foregoing specification, the invention may be implemented using standard 

2 programming and/or engineering techniques using computer programming software, firmware, 

3 hardware or any combination or sub-combination thereof. Any such resulting program(s), 

4 having computer readable program code means, may be embodied within one or more 

5 computer usable media such as fixed (hard) drives, disk, diskettes, optical disks, magnetic tape, 

6 semiconductor memories such as Read-Only Memory (ROM), Programmable Read-Only 

7 Memory (PROM), etc., or any memory or transmitting device, thereby making a computer 

8 program product, i.e., an article of manufacture, according to the invention. The article of 

9 manufacture containing the computer programming code may be made and/or used by 

1 0 executing the code directly or indirectly from one medium, by copying the code from one 

1 1 medium to another medium, or by transmitting the code over a network. An apparatus for 

12 making, using, or selling the invention may be one or more processing systems including, but 
131 not limited to, central processing unit (CPU), memory, storage devices, communication links, 
11=' communication devices, servers, input/output (I/O) devices, or any sub-components or 

IP individual parts of one or more processing systems, including software, firmware, hardware or 

iS any combination or sub-combination thereof^ which embody the invention as set forth in the 

Vf claims. User input may be received fi*om the keyboard, mouse, pen, voice, touch screen, or any 

iP other means by which a human can input data to a computer, including through other programs 

1^^ such as application programs, databases, data sets, or files. 

M 

^ One skilled in the art of computer science will easily be able to combine the software 

22 created as described with appropriate general purpose or special purpose computer hardware to 

23 create a computer system and/or computer sub-components embodying the invention and to 

24 create a computer system and/or computer sub-components for carrying out the method of the 

25 invention. Although the present invention has been particularly shown and described with 

26 reference to a preferred embodiment, it should be apparent that modifications and adaptations 

27 to that embodiment may occur to one skilled in the art without departing from the spirit or 

28 scope of the present invention as set forth in the following claims. 
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CLAIMS 

I claim: 



1 L An article of manufacture for use in a computer system for creating a string of Unicode 

2 characters stored in the memory of the computer system, said article of manufacture comprising 

3 a computer-readable storage medium having a computer program embodied in said medium 

4 which causes the computer system to execute the method steps comprising: 

5 creating a constant whose data type is not a Unicode data type; 

6 storing a string of non-Unicode characters in the constant which is stored in the 

7 memory of the computer; 

h. retrieving a specification of a code page in which the non-Unicode character 

^ string is encoded; 

IP translating the non-Unicode character string stored in the constant into a 

S Unicode character string responsive to the specification of the code page; and 

iS^ storing the Unicode character string in the constant stored in the memory of the 

B computer, 

^ whereby the Unicode character string is created responsive to the entry of the non- 

|£ Unicode character string without the entry of the Unicode character string. 

' r 2. The article of manufacture of claim I wherein the non-Unicode character string is a single 

2 byte character set (SBCS) string, 

1 3. The article of manufacture of claim 1 wherein the non-Unicode character string is a pure 

2 double byte character set (DECS) string. 

1 4, The article of manufacture of claim I wherein the non-Unicode character string is a mixed 

2 SBCS and DECS string. 
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1 5. The article of manufacture of claim 1 wherein the translation is performed by the computer 

2 according to a scope, the scope specifying a portion of a computer program subject to the 

3 translation, 

1 6, The article of manufacture of claim 5 wherein the scope is global, the global scope 

2 specifying that the translation applies to the entire computer program. 

1 1, The article of manufacture of claim 5 wherein the scope is local, the local scope specifying 

2 that the translation applies the subsequent portion of the computer program, 

1 8. The article of manufacture of claim 5 wherein the scope is constant specific, the constant 

% specific scope specifying that the translation applies only to a specific constant. 
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1 9, A method of creating a string of Unicode characters stored in a memory of a computer, said 

2 method comprising the steps of: 

3 creating a constant whose data type is not a Unicode data type; 

4 storing a string of non-Unicode characters in the constant which is stored in the 

5 memory of the computer; 

6 retrieving a specification of a code page in which the non-Unicode character 

7 string is encoded; 

8 translating the non-Unicode character string stored in the constant into a 

9 Unicode character string responsive to the specification of the code page; and 

10 storing the Unicode character string in the constant stored in the memory of the 

1 1 computer, 

fli whereby the Unicode character string is created responsive to the entry of the non- 

Unicode character string without the entry of the Unicode character string. 

Cfl 10. The method of claim 9 wherein the non-Unicode character string is a single byte character 

[| set (SBCS) string. 

11. The method of claim 9 wherein the non-Unicode character string is a pure double byte 

£| character set (DBCS) string, 

1 12. The method of claim 9 wherein the non-Unicode character string is a mixed SBCS and 

2 DBCS string. 

1 13, The method of claim 9 wherein the translation is performed by the computer according to a 

2 scope, the scope specifying a portion of a computer program subject to the translation. 

1 14. The method of claim 13 wherein the scope is global, the global scope specifying that the 

2 translation applies to the entire computer program. 
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15. The method of claim 13 wherein the scope is local, the local scope specifying that the 
translation applies the subsequent portion of the computer program. 

16. The method of claim 5 wherein the scope is constant specific, the constant specific scope 
specifying that the translation applies only to a specific constant 
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17. A computer system for creating a string of Unicode characters stored in a memory of the 
computer system, said computer system comprising: 

a constant whose data type is not a Unicode data type; 

a string of non-Unicode characters stored in the constant which is stored in the 

memory of the computer; 

a specification of a code page in which the non-Unicode character string is 
encoded retrievable from the memory of the computer system; 

a translator for translating the non-Unicode character string stored in the 
constant into a Unicode character string responsive to the specification of the code 
page; and 

memory for storing the Unicode character string in the constant stored in the 
memory of the computer, 

whereby the Unicode character string is created responsive to the entry of the non- 
Unicode character string without the entry of the Unicode character string. 

1 8. The computer system of claim 17 wherein the non-Unicode character string is a single byte 
character set (SBCS) string. 

19. The computer system of claim 17 wherein the non-Unicode character string is a pure 
double byte character set (DBCS) string. 

20. The computer system of claim 17 wherein the non-Unicode character string is a mixed 
SBCS and DBCS string. 

21. The computer system of claim 17 wherein the translation is performed by the computer 
according to a scope, the scope specifying a portion of a computer program subject to the 
translation. 
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22. The computer system of claim 21 wherein the scope is global, the global scope specifying 
that the translation applies to the entire computer program. 

23. The computer system of claim 21 wherein the scope is local, the local scope specifying that 
the translation applies the subsequent portion of the computer program. 

24. The computer system of claim 21 wherein the scope is constant specific, the constant 
specific scope specifying that the translation applies only to a specific constant. 
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I ABSTRACT 

2 

3 METHOD OF, SYSTEM FOR, AND COMPUTER PROGRAM PRODUCT FOR 

4 CREATING AND CONVERTING TO UNICODE DATA FROM SINGLE BYTE 

5 CHARACTER SETS, DOUBLE BYTE CHARACTER SETS, OR MIXED 

6 CHARACTER SETS COMPRISING BOTH SINGLE BYTE AND DOUBLE BYTE 

7 CHARACTER SETS 
8 

9 Methods for specifjdng the types of constants whose character values are to be 

1 0 converted to Unicode; for specifying which code page or pages are used for specifying the 

I I character encodings used in the source program for writing the character strings to be converted 
1 2. . to Unicode; and that can be used to perform conversions from SBCS, mixed SBCS/DBCS, and 
1 si pure DECS character strings to Unicode. A syntax suitable for specifying character data 

1 S conversion from SBCS, mixed SBCS/DBCS, and pure DBCS representation to Unicode 

1 y utilizes an extension to the conventional constant subtype notation. In converting the nominal 

m value data to Unicode, currently relevant SBCS and DBCS code pages are used, as specified by 

ly three levels or scopes derived from either global options, from local AOPTIONS statement 

Ip specifications, or from constant-specific modifiers. Global code page specifications apply to 

1^ the entire source program. These global specifications allow a programmer to declare the 

M source-program code page or code pages just once. These specifications then apply to all 

constants containing a request for conversion to Unicode. Local code page specifications apply 

22 to all subsequent source-program statements. These local specifications allow the programmer 

23 to create groups of statements containing Unicode conversion requests, all of which use the 

24 same code page or code pages for their source-character encodings. Code page specifications 

25 that apply to individual constants allow a detailed level of control over the source data 

26 encodings to be used for Unicode conversion. The conversion of source data to Unicode may 

27 be implemented inherently to the translator (assembler, compiler, or interpreter) wherein it 

28 recognizes and parses the complete syntax of the statement in which the constant or constants 

29 is specified, and performs the requested conversion. Alternatively, an external fimction may be 
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1 invoked by a variety of source language syntaxes which parses as Uttle or as much of the source 

2 statement as its implementation provides, and returns the converted value for inclusion in the 

3 generated machine language of the object program. Alternatively, the conversion may be 

4 provided by the translator's macro instruction definition facility. 
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Mapping Table for Unicode Conversion 
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Unicode character corresponding to 
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Begin 



302 



Determine code page or pages required 
for this source string's conversion 



304 



300 



306 

Required i^jq 
mapping table 
loaded? 



Yes 



307 



Load mapping table 



314 



Determine number of characters 
(NCS) in the source string 




316 



Set count of source characters (K) to 1 

* 



Issue error 
message 



310 



318 



Get K-th source character 



O 



Use binary value of K-th character to index 
into K-th entry in the chosen l\/lapping table, 
to get K-th Unicode character; store it in "target" field 



Set K to K+1 



— 322 



320 



Fig. 3 
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Is K > NCS? ^ — ^ No 
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Conversion of this constant completed. 





Begin 



402 



404 



Initialize: set scan pointer to start of nominal-value in source string; 

set SBCS mode 
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Step scan pointer by 1 byte 
(discard the SO character); 
set DBCS mode 
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Source character Is the byte 
at the scan pointer position 
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at the scan pointer position 
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using SBCS mapping table 
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Step scan pointer by 2 bytes 
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Set SBCS mode 



Step scan pointer by 1 byte 




416 



Translation of source string completed 
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End 
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Title 'DCU — a macro to generate Unicode constants' 
500 IWacro 

&L DCU &A,&Palr=Yes,&CodePage=500 



* 
* 
* 
* 
* 
* 
* 

* 
*_ 



.V1 

* 

.V2 

* 

.V3 

* 

.V4 



Expected argument: an apostrophe-delimited string of one or 
more EBCDIC characters, with paired internal apostrophes and 
ampersands. The pairing is preserved in the output string if 
&Pair=Yes, and is not if &Pair=No. 
Initial limitation: max of 63 characters in quoted argument, 
except for paired characters if &Pair=No. 

Declare variables used internally 



* 
* 



LcIC &M(256) 
LcIC &V 
LcIC &R 
LclB &P 
LcIA &J 
LcIA &N 



Mapping and validation table 
Valid EBCDIC characters 
Result Unicode string 
True If '& pairs retained In output 
Counter 
Temp 



Validate macro arguments 



Alf (N'&SysList gt 0).V1 Check for argument 

MNote 8,'DCU — No argument.' 

IViExit 



Alf (N'&SysList It 2).V2 Check single argument 
MNote 8,'DCU — More than one argument.' 
MExit 

Alf (K'&A ge 3).V3 

MNote 8,'DCU — argument too short, or badly formed.' 
MExit 

Alf ('&A'(1,1) eq "" and '&A'(K'&A,1) eq "").V4 
MNote 8,'DCU — argument not properly quoted.' 
MExit 

Alf ('YES' eq (Upper '&Pair')).V5 Check if pairing wanted 
Alf ('NO' eq (Upper '&Pair')).V6 Check if no pairing 
MNote 8,'DCU — invalid value of &&Pair.' 
MExit 



.V5 
&P 



ANop , 
SetB 1 



Indicate no pairing of '& In output 



Fig. 5 



^ 600 

.V6 ANop , 

Alf C&CodePage' eq '500').V7 Check code page 

MNote 8,'DCU — Code Page &CodePage not supported yet.' 

MExit 

.V7 ANop , 



Arguments validated. Set SBCS and Unicode chiaracter sets * 



.VX ANop , Set up mapping table 

&J SetA &J+1 

&M(&J) SetC " Initialize to null 

Alf (&J It 256).VX Loop for all 256 code points 

* 

&V SetC '01 23456789ABCDEFGHI JKLMNOPQRSTU VWXYZabcdef ghijklmnopqrs 
tuvwxyz_@#$%&&*()-+=,./: ;"<>"? ' 

* 

.* The following Is the conversion table from CCSID 500 to Unicode 

&U SetC '3031 323334353637383941 42434445464748494A4B4C4D4E4F5051 5* 
2535455565758595A61 62636465666768696A6B6C6D6E6F7071 72737* 
475767778797A5F4023242526262A28292D2B3D2C2E2F3A3B273C3E2* 

23F20' 

* 

Note: Conditional-assembly string constants require paired 
.* apostrophes and ampersands; ampersands are not reduced to a 

* single character internally. Thus, the encoding for & appears 
twice in the &U encoding string above. 

* 

&J SetA 1 

* * 

* Build the EBCDIC-to-Unicode mapping table * 



.VY ANop 

&C SetC (Double '&V'(&J,1)) Pick character from valid string 

&C SetC 'C'ScC'" Character in self-defining term 

&N SetA &C+1 Convert to numeric 

&M(&N) SetC '&U'(2*&J-1,2) Put Unicode digits in mapping table 

&J SetA &J+1 Increment &J 

Alf (&J le K'&V).VY Set up all valid encodings 



Fig. 6 



I t t r 

* . * 

.* Convert each SBCS argument character to Unicode equivalent * 

* * 

&J SetA 2 Start after initial apostrophe 

* 

.Z ANop , Head of translation loop 

&C SetC '&A'(&J,1) &J-th character from argument 

700 Alf C&C ne and '&C' ne '&&'(1,1)).Z1 Is It '& ? 

Alf (&P).Z1 Have '&, is pairing wanted? 

&J SetA &J+1 No pairing, step input by one 
.Z1 ANop , 

&C SetC (Double 'AC') Pair '& for self-defining term 

&C SetC 'C'&C'" Change to arithmetic value 

&N SetA &C+1 Convert to numeric 

&C SetC '&M(&N)' Get Unicode mapping 

Alf C&C ne ").Z2 Validly mappable if not null 
MNote 4,'DCU — Unknown character at position &J convene* 
d to blank.' 

&C SetC '20' Unicode blank 

.Z2 ANop , 

&R SetC '&R.OO&C' Add new character to end 
&J SetA &J+1 

Alf (&J It K'&A).Z Repeat for all internal characters 

* . . .* 

Generate the requested Unicode constant 

* , * 

&L DC X'&R' 
IVIEnd 



Fig. 7 
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